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Abstract — This  paper  shows  how  the  technique  of  retiming  can  be  used  to  transform  a 
given  synchronous  circuit  into  a  more  efficient  circuit  under  a  variety  of  different  cost  criteria. 
We  model  a  circuit  as  a  graph,  and  we  give  an  0(|V|  |£|  Ig  |V|)  algorithm  for  determining 
an  equivalent  circuit  with  the  smallest  possible  clock  period.  We  show  that  the  problem  of 
determining  an  equivalent  retimed  circuit  with  minimum  state  (total  number  of  registers) 
is  polynomial-time  solvable.  This  result  yields  a  polynomial-time  optimal  solution  to  the 
problem  of  pipelining  combinational  circuitry  with  minimum  register  cost.  We  also  give 
a  characterization  of  optimal  retiming  based  on  an  efficiently  solvable  mixed-integer  linear 
programming  problem.  - y 


Keywords:  combinatorial  optimization,  digital  circuitry,  graph  theory,  linear  program¬ 
ming,  network  flow,  pipelining,  retiming. 


1.  Introduction 

The  goal  of  VLSI  design  automation  is  to  speed  the  design  of  a  system  without  sacrificing 
the  quality  of  implementation.  A  common  means  of  achieving  this  goal  is  through  the  use  of 
optimization  tools  that  improve  the  quality  of  a  quickly  designed  circuit.  In  this  paper  we 
show  how  to  optimize  clocked  circuits  by  relocating  registers  so  as  to  reduce  combinational 
rippling.  Unlike  pipelining,  this  technique,  which  we  call  retiming,  does  not  increase  circuit 
latency. 

In  order  to  illustrate  retiming,  consider  the  problem  of  designing  a  digital  correlator.  The 
correlator  takes  a  stream  of  bits  Xo,Xx,xj,...  as  input  and  compares  it  with  a  fixed-length 

This  research  was  supported  in  part  by  the  Defense  Advanced  Research  Projects  Agency  under  Contract 
N0OO14-8O-C-OG22  and  by  the  Office  of  Naval  Research  under  Contract  N00014-76-C-0370.  Charles  Leiserson 
was  supported  in  part  by  an  NSF  Presidential  Young  Investigator  Award.  Most  of  the  results  reported  here  were 
obtained  while  James  Saxe  was  in  the  Computer  Science  Department  at  Carnegie-Mellon  University.  During 
part  of  this  period,  he  was  supported  by  an  IBM  graduate  fellowship. 


pattern  oo,  oi, . . . ,  a*.  After  receiving  each  input  x*  (*  >  k),  the  correlator  produces  as  output 
the  number  of  matches 


Vi  =  2*  (*<-»»«>•)  . 
y«o 


(1) 


where  6  is  the  comparison  function 


l.  if  *  =  y; 

otherwise. 


Figure  1  shows  a  design  of  a  simple  correlator  for  the  case  when  k  =  3.  Correlator  1  consists 
of  two  kinds  of  functional  elements,  adders  and  comparators,  whose  I/O  characteristics  are 
shown  in  the  figure.  The  boxes  between  the  comparators  are  registers  which  act  to  shift  the 
Xi  to  the  right  down  the  length  of  the  correlator.  On  each  tick  of  the  global  clock,  each  x<  is 
compared  with  a  character  of  the  pattern,  and  the  adders  sum  up  the  number  of  matches. 


P 


Figure  1.  Correlator  1:  A  aimple  circuit  made  of  two  kinds  of  functional  elements.  Each  com¬ 
parator  6  has  a  propagation  delay  of  3  esec,  and  each  adder  +  has  a  propagation  delay  of  7  esec. 

A  longest  path  of  combinational  rippling  starts  at  the  register  on  the  connection  labeled  A,  and 
thus  the  clock  period  of  the  circuit  is  24  esec. 

This  design,  though  easy  to  understand,  has  poor  performance.  Between  ticks  of  the  clock, 
the  partial  sums  of  the  matches  ripple  up  the  length  of  the  correlator.  Suppose,  for  instance, 
that  each  adder  has  a  propagation  delay  of  7  esec,1  and  each  comparator  has  a  propagation 
delay  of  3  esec.  Then  the  clock  period  must  be  at  least  24  esec — the  time  for  a  signal  to 

iR«cnll  (hat  on*  cptocecond  (ssec)  equals  on*  on*-sillionth  of  a  second. 


propagate  from  the  register  on  the  connection  labeled  A  through  one  comparator  and  three 
adders. 

A  design  that  gives  better  performance  can  be  derived  by  removing  the  register  on  connec¬ 
tion  A  from  Correlator  1  and  inserting  a  new  register  on  connection  B,  as  shown  in  Figure  2. 
To  show  that  these  two  correlators  are  indeed  functionally  equivalent,  consider  the  portion 
of  the  circuit  surrounded  by  the  dashed  box  in  the  figure.  It  communicates  with  the  rest  of 
the  circuit  only  through  connections  A  and  B.  When  the  register  on  A  is  removed,  all  input 
signals  to  this  portion  of  the  circuit  arrive  one  clock  tick  earlier,  and  thus  the  boxed  portion 
of  Correlator  2  performs  the  same  sequence  of  computations  as  in  Correlator  1,  but  one  clock 
tick  earlier.  Since  the  output  from  the  boxed  portion  of  Correlator  2  is  delayed  one  clock  tic1: 
by  the  new  register  on  connection  B,  the  remainder  of  the  circuit  sees  the  same  behavior  as  in 
Correlator  1.  We  say  that  the  three  functional  elements  in  the  boxed  portion  of  Correlator  2 
lead  by  one  clock  tick  the  corresponding  functional  elements  in  Correlator  1.  Alternatively,  we 
say  that  the  three  elements  in  Correlator  1  lag  by  one  clock  tick  the  corresponding  elements 
in  Correlator  2. 


Figure  2.  Correlator  2:  A  retimed  circuit  functionally  equivalent  to  Correlator  1  but  which  ia 
more  efficient.  The  longeat  path  of  combinational  rippling  begin*  at  the  register  on  connection  C, 
and  the  clock  period  of  this  circuit  ia  17  esec. 

Correlator  1  and  Correlator  2  are  functionally  equivalent,  but  the  performance  of  the  re¬ 
timed  circuit  Correlator  2  is  better  than  that  of  Correlator  1.  The  clock  period  of  Correlator  2 
is  17  esec — the  time  for  a  signal  to  propagate  from  the  register  on  connection  C  through  one 
comparator  and  two  adders.  Notice  that  the  two  designs  use  the  same  functional  elements 
connected  in  the  same  manner  and  differ  only  in  the  locations  of  registers.  Correlator  2 
has  the  I/O  characteristic  specified  by  Equation  (1),  but  it  should  be  apparent  that  a  direct 
verification  requires  considerably  more  effort  than  the  verification  of  Correlator  1. 

Retiming,  the  technique  of  inserting  and  deleting  registers  in  such  a  way  as  to  preserve 
function,  can  be  used  to  produce  an  even  faster  circuit  than  Correlator  2.  Section  4  gives 
an  implementation  of  the  correlator  that  achieves  a  clock  period  of  13.  Remarkably,  if  the 
pattern  of  comparators  and  adders  is  extended  arbitrarily  to  the  right,  a  clock  period  of  14 


can  always  be  achieved  by  retiming.  In  this  paper  we  exhibit  a  polynomial-time  algorithm 
for  determining  a  retiming  of  a  circuit  that  minimizes  clock  period. 

The  remainder  of  this  paper  is  organized  as  follows.  Section  2  presents  the  graph- theoretic 
model  of  synchronous  circuits  used  in  this  paper.  In  Section  3  we  formally  describe  the 
operation  of  retiming  [12],  in  which  registers  are  deleted  from  some  connections  of  a  circuit 
and  added  to  others  so  that  the  circuit  function  is  preserved.  Section  4  gives  a  simple 
polynomial-time  algorithm  for  minimizing  the  clock  period  of  a  circuit.  Section  5  gives  an 
asymptotically  more  efficient  algorithm  to  solve  the  same  problem.  In  Section  6  we  show  that 
the  problem  of  finding  a  retiming  of  a  circuit  that  minimizes  clock  period  can  be  reduced  to  an 
efficiently  solvable  mixed-integer  linear  programming  problem,  thus  providing  a  framework 
for  retiming  based  on  mathematical  programming. 

Sections  7,  8,  and  9  discuss  extensions  of  these  results.  Section  7  considers  the  special 
case  where  all  functional  elements  have  identical  propagation  delays  and  shows  that  optimal 
retimings  can  be  found  more  efficiently  in  this  case.  The  section  also  discusses  the  relationship 
of  this  work  to  systolic  computation  and  shows  how  to  improve  the  performance  of  many 
systolic  circuits  in  the  literature.  While  earlier  sections  are  concerned  with  finding  retimings 
that  are  optimal  in  the  sense  of  minimizing  clock  period,  Section  8  examines  a  different 
optimization  criterion,  namely  minimizing  the  total  amount  of  state  (number  of  registers) 
in  the  retimed  circuit.  In  particular,  we  show  that  the  problem  of  retiming  a  circuit  to 
minimize  its  state  subject  to  an  upper  bound  on  the  clock  period  can  be  reduced  to  the 
linear-programming  dual  of  a  minimum-cost  flow  problem,  and  hence  can  be  solved  optimally 
in  polynomial  time.  Section  9  extends  our  methods  to  a  more  general  circuit  model  in  which 
individual  functional  elements  may  have  nonuniform  propagation  delays — c.g.,  the  low-order 
output  bit  of  an  adder  may  be  available  earlier  than  the  high-order  bit. 

In  Section  10,  we  briefly  mention  further  extensions,  including  the  application  of  our 
algorithms  to  optimal  pipelining  of  combinational  circuitry. 

2.  Preliminaries 

In  this  section  we  define  the  notations  and  terminology  needed  in  the  paper  and  present 
our  graph-theoretic  model  of  digital  circuits.  We  conclude  by  giving  a  simple  algorithm  for 
determining  the  minimum  feasible  clock  period  of  a  circuit  from  its  graph. 

We  can  view  a  circuit  abstractly  as  a  network  of  functional  elements  and  globally  clocked 
registers.  The  registers  are  assumed  to  have  the  following  characteristics:  each  has  a  single 
input  and  a  single  output;  all  are  clocked  by  the  same  periodic  waveform;  and  at  each  clock 
tick,  each  storage  element  samples  its  input  and  the  sampled  value  is  made  available  at  the 
output  until  the  next  tick.  We  also  assume  that  changes  in  the  output  of  one  storage  element 
do  not  interfere  with  the  input  to  another  at  the  same  clock  tick.  An  example  of  such  a 
storage  element  is  an  edge-triggered,  master-slave,  D-type  flip-flop  [18]. 

The  functional  elements  provide  the  computational  power  of  the  circuit.  Our  model  is 
unconcerned  with  the  level  of  complexity  of  the  functional  elements — they  might  be  NAND 
gates,  multiplexors,  or  ALU’s,  for  example.  Each  functional  element  has  an  associated  prop - 


agation  delay.  The  outputs  of  a  functional  element  at  any  time  are  defined  as  a  specified 
function  of  its  inputs,  provided  that  all  the  inputs  have  been  stable  for  a  time  at  least  as 
long  as  its  propagation  delay.  We  make  the  conservative  assumption  that  when  an  input  to 
a  functional  element  changes,  the  outputs  may  behave  arbitrarily  until  they  settle  to  their 
final  values. 

To  be  precise,  we  model  a  circuit  as  a  finite,  vertex-weighted,  edge-weighted,  directed 
multigraph  G  =  (V,  E,  d,  w)  (henceforth,  we  shall  simply  say  "graph”  or,  more  frequently, 
“circuit”).  Figure  3  shows  the  graph  of  Correlator  1  from  Figure  1.  The  vertices  V  of  the 
graph  model  the  functional  elements  of  the  circuit.  Each  vertex  v  €  V  is  weighted  with  its 
numerical  propagation  delay  d(v).  The  directed  edges  E  of  the  graph  model  interconnections 
between  functional  elements.  Each  edge  e  €  E  connects  an  output  of  some  functional  element 
to  an  input  of  some  functional  element  and  is  weighted  with  a  register  count  w(e).  The  register 
count  is  the  number  of  registers  along  the  connection.3  Between  two  vertices,  there  may  be 
multiple  edges  with  different  register  counts. 


v,  v.  v. 


Figure  3.  The  graph  model  of  Correlator  1  from  Figure  1. 

Vertices  can  be  designated  to  represent  interfaces  with  the  external  world,  and  each  such 
vertex  is  given  zero  propagation  delay,  as  is  shown  for  vertex  v\  in  Figure  3.  (We  elaborate  on 
this  technicality  in  Section  10.)  If  the  relative  times  of  events  at  multiple  external  interfaces 
must  be  preserved,  we  treat  them  as  a  single  interface  and  represent  them  as  a  single  vertex 
with  multiple  incident  edges.  Otherwise,  we  assume  that  multiple  external  interfaces  are 
independent  of  each  other  and  cannot  communicate  with  each  other  externally.  For  most  of 
our  theory,  external  interfaces  can  be  handled  as  ordinary  vertices,  and  thus  our  formalism 
omits  them. 

We  shall  use  the  following  terminology  extensively.  To  avoid  confusion  between  vertex- 
weight  functions  such  as  the  propagation  delay  d  and  edge-weight  functions  such  as  the 
register  count  w,  we  shall  use  the  term  weight  for  edge-weight  functions  only.  In  fact,  the 

3  If  an  output  of  a  functional  clement  fane  out  to  more  than  one  other  functional  element,  the  tingle  inter¬ 
connection  can  be  treated,  without  loee  of  generality,  ae  eeveral  edges,  each  with  an  appropriate  weight.  Any 
optimisation  can  be  translated  from  the  model  back  to  a  circuit  with  fanout.  Section  8  examinee  fanout  more 
cloeely. 


only  vertex-weight  functions  we  use  are  the  propagation  delays  d(v),  and  in  general  we  shall 
refer  to  the  particular  edge  weights  w(e)  of  a  circuit  as  register  counts.  If  e  is  an  edge  in  a 
graph  that  goes  from  vertex  tt  to  vertex  v,  we  shall  use  the  notation  u  -2*  v.  In  the  event  that 
the  identity  of  either  the  head  or  the  tail  of  an  edge  is  unimportant,  we  shall  use  the  symbol 
?,  as  in  u  A?. 

For  a  graph  G,  we  shall  view  a  path  p  in  G  as  a  sequence  of  vertices  and  edges.  If  a  path 
p  starts  at  a  vertex  u  and  ends  at  a  vertex  v,  we  use  the  notation  u  v.  A  simple  path 
contains  no  vertex  twice,  and  therefore  the  number  of  vertices  exceeds  the  number  of  edges 
by  exactly  one. 

We  extend  the  register  count  function  w  in  a  natural  way  from  single  edges  to  arbitrary 
paths.  For  any  path  p  =  v0  ^  t>i  ^  v*,  we  define  the  path  weight  as  the  sum  of  the 

weights  of  the  edges  of  the  path: 

k-l 

w(p)  ~  5Z  • 

•*o 

Similarly,  we  extend  the  propagation  delay  function  d  to  simple  paths.  For  any  simple  path 
p  —  t>0  Vl  %  . . .  vk,  we  define  the  path  delay  as  the  sum  of  the  delays  of  the  vertices  of 
the  path: 

<*(p)  =  £<*(»<)  • 

isO 

In  order  that  a  graph  G  =  ( V ,  E,  d,  tv)  have  well-defined  physical  meaning  as  a  circuit,  we 
place  nonnegativity  restrictions  on  the  propagation  delays  d(v)  and  the  register  counts  w(e): 

Dl.  The  propagation  delay  d(v)  is  nonnegative  for  each  vertex  v  6  V. 

Wl.  The  register  count  w(e)  is  a  nonnegative  integer  for  each  edge  e  €  E. 

We  also  impose  the  restriction  that  there  be  no  directed  cycles  of  zero  weight: 

W2.  In  any  directed  cycle  of  G,  there  is  some  edge  with  (strictly)  positive  register  count. 

We  define  a  synchronous  circuit  as  a  circuit  that  satisfies  Conditions  Dl,  Wl,  and  W2. 
The  reason  for  including  Condition  W2  is  that  whenever  an  edge  e  between  two  vertices  u 
and  v  has  zero  weight,  a  signal  entering  vertex  u  can  ripple  unhindered  through  vertex  u 
and  subsequently  through  vertex  v.  If  the  rippling  can  feed  back  upon  itself,  problems  of 
asynchronous  latching,  oscillation,  and  race  conditions  can  arise.  By  prohibiting  zero-weight 
cycles,  Condition  W2  prevents  these  problems  from  occurring,  provided  that  the  s‘ystem 
clock  runs  slowly  enough  to  allow  the  outputs  of  ail  the  functional  elements  to  settle  between 
each  two  consecutive  ticks. 

For  any  synchronous  circuit  G,  we  define  the  (minimum  feasible)  clock  period  $(G)  as  the 
maximum  amount  of  propagation  delay  through  which  any  signal  must  ripple  between  clock 
ticks.  Condition  W2  guarantees  that  the  clock  period  is  well  defined  by  the  equation 

♦(G)  =  max  (d(p) :  ur(p)  =  0}  . 

For  the  circuit  graph  in  Figure  3  the  clock  period  is  24,  which  corresponds  to  the  sum  of  the 
propagation  delays  along  the  path  vt  —>  vs  —>  —>  v7. 
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Determination  of  the  clock  period  4(G)  is  relatively  simple.  The  algorithm  we  presem 
here  is  similar  to  an  algorithm  that  forms  a  part  of  a  design  tool  developed  at  America:' 
Microsystems,  Inc.  [15]. 

Algorithm  CP  ( Compute  the  clock  period  of  a  circuit).  This  algorithm  computes  the  dock 
period  4(C)  for  a  synchronous  circuit  G  =  (V,  E ,  d, tv). 

1.  Let  G0  be  the  subgraph  of  G  that  contains  precisely  those  edges  e  with  register  count 
m{e)  =  0. 

2.  By  Condition  W2,  Go  is  acyclic.  Perform  a  topological  sort  on  Co,  totally  ordering  its 
vertices  so  that  if  there  is  an  edge  from  vertex  u  to  vertex  v  in  G0,  then  u  precedes  v 
in  the  total  order. 

3.  Go  through  the  vertices  in  the  order  defined  by  the  topological  sort.  On  visiting  each 
vertex  v,  compute  the  quantity  A(v)  as  follows: 

a.  If  there  is  no  incoming  edge  to  v,  set  A(v) «—  d(v). 

b.  Otherwise,  set  A(t/)  «—  d(v)  +  max  { A(u) :  u  A  v  and  u/(e)  =  o}. 

4.  The  clock  period  4(C)  is  max„6y  A(v).  | 

The  algorithm  works  because  for  each  vertex  v,  the  quantity  A(v)  equals  the  maximum  sum 
d(p)  of  vertex  delays  along  any  zero-weight  directed  path  p  in  G  such  that  ?  v.  The 
running  time  is  0(|£|). 

3.  Retiming 

Retiming  transformations  alter  the  clock  period  of  a  circuit  by  inserting  and  deleting  registers, 
but  without  otherwise  affecting  the  circuit’s  structure.  This  section  formally  defines  retiming 
and  proves  some  simple  properties  of  the  transformation. 

A  retiming  can  be  viewed  as  an  assignment  of  a  lag  to  each  vertex  in  a  circuit,  and  this  is 
how  we  shall  define  it  formally.  A  retiming  of  a  circuit  G  =  (V,  E,  d,  w)  is  an  integer-valued 
vertex-labeling  r  :  V  — *  Z.  The  retiming  specifies  a  transformation  of  the  original  circuit  in 
which  the  registers  are  added  and  removed  so  as  to  change  the  graph  C  into  a  new  graph 
Gr  =  (V,E,d,wr),  where  the  edge-weighting  wr  is  defined  for  an  edge  u  A  v  by  the  equation 

wr{e)  =  w(e)  +  r(v)  -  r(u)  .  (2) 

In  the  example  of  Figure  3,  the  retiming  that  assigns  —1  to  functional  elements  vs,  t/4,  and 
vs,  and  assigns  0  to  all  other  vertices,  yields  the  circuit  of  Figure  2. 

Equation  (2),  which  tells  how  retiming  affects  the  register  counts  of  edges,  extends  natu¬ 
rally  to  paths. 

Lemma  1.  Let  G  =  (V,  E,  d,  w)  be  a  synchronous  circuit,  and  let  r  be  a  retiming. 

Then  for  any  path  u  v  in  G,  we  have 


Proof.  Suppose  p  is  composed  of  vertices  and  edges  vo  t>i  v*.  We  have 

k— 1 

«>r(p)  =  52wr(ei) 

«*0 

fc-1 

=  H  (to(e<)  +  r(vt+1)  -  r(v,)) 

»=o 

=  fl  «(«*)  +  £  (r(w<+i)  “  r(®<)) 

»=0  »=0 

=  tu(p)  +  r(vk)  -  >*(vo) 
because  the  sum  on  the  right  telescopes.  | 

Corollary  2.  Let  G  —  (V,E,d,w)  be  a  synchronous  circuit,  and  letr  be  a  retiming 
on  the  vertices  of  G.  Then  for  any  cycle  p  in  G,  we  have  wr(p)  =  u/(p). 

Proof.  Immediate  from  Lemma  1. 1 

A  retiming  of  r  of  a  circuit  G  is  legal  if  the  retimed  graph  Gr  satisfies  Conditions  Wl 
and  W2.  An  arbitrary  assignment  of  lags  to  the  vertices  of  a  circuit  G  may  cause  the 
retimed  circuit  Gr  to  violate  Condition  Wl,  which  says  that  no  edge  may  have  a  negative 
register  count.  This  condition  must  be  checked  explicitly  in  order  to  ensure  that  a  retiming  is 
legal.  Interestingly  enough,  Condition  W2  need  not  also  be  checked  because  of  the  following 
consequence  of  Corollary  2. 

Corollary  3.  Let  G  =  (V,E,d,w)  be  a  synchronous  circuit,  and  let  r  be  a 
retiming  on  the  vertices  of  G  such  that  Gr  satisfies  Condition  Wl.  Then  r  is  a 
legal  retiming. 

Proof.  Since  the  propagation  delays  d  are  unaffected  by  retiming,  Condition  Dl  is  satisfied 
by  graph  Gr.  Since  Condition  Wl  is  true  by  supposition,  it  remains  only  to  show  that  Gr 
satisfies  Condition  W2.  Let  p  be  any  cycle  in  G.  We  must  show  that  p  includes  at  least  one 
edge  e  such  that  ui,(e)  >  0.  Since  graph  G  satisfies  Conditions  Wl  and  W2,  the  register 
count  tu(p)  of  the  cycle  in  G  must  be  positive.  But  by  Corollary  2,  the  register  count  wr(p) 
of  the  cycle  in  Gr  is  equal  to  u>(p)  and  is  therefore  positive.  Hence,  there  must  be  an  edge  on 
the  cycle  in  Gr  that  has  positive  register  count.  | 

To  conclude  this  section,  we  comment  that  it.  is  .necessary  to  prove  that  when  a  circuit  G 
is  retimed  to  produce  a  new  graph  G„  the  new^Ircuit  TS^uncttanally  equivalent,  as  seen  by 
the  external  world,  to  the  original — provided,  of  course,  that  Gr  satisfies  Conditions  Wl  and 
W2.  Such  a  proof  can  be  found  in  [12],  which  also  contains  a  technical  definition  of  the  term 
“equivalent.” 

Moreover,  we  can  show  that  retiming  is,  in  a  sense,  the  most  general  possible  method 
for  changing  the  register  counts  within  a  circuit  without  disturbing  the  circuit’s  function. 
Although  we  do  not  formally  prove  it  here,  we  outline  the  thread  of  reasoning.  (For  an 


example  of  a  similar  argument  used  to  prove  a  weaker  result,  see  the  proof  of  Theorem  3  in 
[12]).  Without  loss  of  generality,  assume  that  any  circuit  G  =  {V,E,d,  w)  under  discussion 
has  the  following  two  properties.  1.  Every  vertex  v  €  V  is  connected  by  a  path  to  some 
external  interface.  2.  Every  vertex  v  6  V  has  at  least  one  input.  (Otherwise,  v  computes  a 
constant  function.3)  Given  the  graph  of  such  a  circuit,  but  no  knowledge  of  what  functions 
are  computed  by  the  functional  elements,  it  is  impossible,  other  than  by  retiming,  to  alter  the 
register  counts  on  the  edges  and  be  assured  that  the  external  behavior  is  unchanged.  For  any 
relabeling  of  the  edge  weights  that  is  not  a  retiming,  an  adversary  can  specify  the  functional 
elements  in  such  a  way  that  the  new  circuit  behaves  differently  from  the  original  circuit.  We 
omit  the  details  of  this  argument. 

4.  An  algorithm  for  clock  period  minimization. 

This  section  presents  a  polynomial-time  algorithm  for  retiming  a  circuit  so  as  to  maximize 
performance.  Specifically,  we  solve  the  following  clock  period  minimization  problem:  Given 
a  circuit  graph  G  =  {V,E,  d,  w),  find  a  legal  retiming  r  of  G  such  that  the  clock  period  $(Gr) 
of  the  retimed  circuit  Gr  is  as  small  as  possible.  The  solution  of  this  problem  depends  on 
some  basic  results  from  combinatorial  optimization  and  graph  theory.  In  particular,  we  rely 
on  the  fact  that  the  following  linear  programming  problem  can  be  solved  efficiently. 

Problem  LP.  Let  S  be  a  set  of  m  linear  inequalities  of  the  form 

Xj  x<  fC  Oij  (3) 

on  the  unknowns  Xi,X], . . .  ,xn,  where  the  Oij  are  given  real  constants.  Determine 
feasible  values  for  the  unknowns  x,-,  or  determine  that  no  such  values  exist. 

Constraint  systems  in  which  each  equation  has  the  form  of  Inequality  (3)  have  been 
studied  extensively.  Any  such  system  of  linear  inequalities  can  be  satisfied — or  determined 
to  be  inconsistent — in  O(mn)  time  by  the  Bellman- Ford  algorithm  [8,  p.  74]. 

The  algorithm  for  minimizing  the  clock  period  of  a  circuit  is  based  on  an  alternative 
characterization  of  clock  period  in  terms  of  two  quantities  which  we  now  define: 

VF(u,v)  =  min  {w(p)  :  u  A*  v}  , 

D[u,v )  =  max  |d(p)  :  u  -A*  v  and  tt/(p)  =  W(u, v) }  . 

The  quantity  W(u,v)  is  the  minimum  number  of  registers  on  any  path  from  vertex  u  to 

vertex  v.  We  call  a  path  u  -A*  v  such  that  w(p)  =  W(u,v )  a  critical  path  from  u  to  v.  The 

quantity  D(u,v)  is  the  maximum  total  propagation  delay  on  any  critical  path  from  u  to  v. 
Both  quantities  are  undefined  if  there  is  no  path  from  u  to  v.  Figure  4  shows  the  values  for 
Correlator  I. 

3  In  the  graph  (Figure  1)  of  Correlator  1,  for  example,  we  do  not  use  functional  elements  to  input  the  constants 
a,,  but  have  instead  incorporated  them  into  the  comparators. 
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Figure  4.  Table*  showing  the  values  of  the  functions  W  and  D  for  Correlator  1.  The  quantity 
W( u,«)  is  the  number  of  registers  on  a  minimum-weight  path  from  u  to  v,  and  D(u,v)  is  the 
maximum  propagation  delay  along  any  such  critical  path.  The  distinct  entries  in  the  table  for  D 
include  all  possible  clock  periods  for  any  retiming  of  Correlator  1.  A  legal  retiming  r  has  a  clock 
period  less  than  c  it  and  only  if  Wr( u,  v)  >  0  wherever  D(u,  v)  >  c.  Circled  entries  in  the  table 
for  D  are  explained  in  the  last  paragraph  of  Section  4. 

Lemma  4.  Let  G  =  (V,E,d,w)  be  a  synchronous  circuit,  and  let  c  be  any  positive 
real  number.  The  following  are  equivalent: 

4.1.  *(G)  <  e. 

4.2.  For  all  vertices  u  and  v  in  V,  if  D(u,v)  >  c,  then  W(tt,v)  >  1. 


Proof.  (4.1  =>  4.2):  Suppose  $(67)  <  e,  and  let  u  and  v  be  vertices  in  V  such  that  D(u,v)  >  c. 
If  W[u,  u)  =  0,  then  there  exists  a  path  p  from  u  to  v  with  propagation  delay  d(p)  =  D(u,v), 
which  is  greater  than  c,  and  register  count  w(p)  =  W(u,v)  =  0.  Contradiction. 

(4.2  =>  4.1):  Suppose  4.2  holds,  and  let  u  v  be  any  zero-weight  path  in  G.  Then  we 
have  W(u,t/)  =  ti/(p)  =  0,  which  implies  <f(p)  <  D(u,v)  <  c.  | 

It  is  not  difficult  to  compute  W  by  solving  the  all-pairs  shortest-paths  problem  in  G. 
Common  ways  of  solving  this  problem  are  the  Floyd- Warshall  method  [8,  p.  86],  which  runs 
in  0(| VI3)  time,  and  Johnson’s  algorithm  [5],  which  runs  in  0(|V|  |J5|  +  |V|2  lg  |V|)  time  using 
the  Fibonacci  heap  data  structure  due  to  Fredman  and  Tarjan  [1].  The  basic  operations 
on  weights  used  by  these  algorithms  are  addition  and  comparison.  The  following  algorithm 
shows  that  with  a  suitably  chosen  weight  function,  an  all-pairs  shortest-paths  algorithm  can 
be  used  to  compute  both  W  and  D. 


Algorithm  WD  ( Compute  W  and  D).  Given  a  synchronous  circuit  G  =  (V,  E,  d,  w),  this 
algorithm  computes  !V(u,v)  and  D(u,v)  for  all  «,v  €  V  such  that  u  is  connected  to  v  in  G. 

1.  Weight  each  edge  u  -^?  in  E  with  the  ordered  pair  (u/(e),  -d(u)). 

2.  Using  the  weighting  from  Step  1,  compute  the  weight  of  the  shortest  path  joining  each 
connected  pair  of  vertices  by  solving  an  all-pairs  shortest-paths  algorithm.  (In  the 
all-pairs  algorithm,  add  two  weights  by  performing  componentwise  addition.  Compare 
weights  using  lexicographic  ordering.) 

3.  For  each  shortest-path  weight  (x,  y)  between  two  vertices  u  and  v,  set  W (u,  v)  «—  x 
and  D(u,v)  +-  d(v)  -  y.  | 

The  reason  that  W  and  D  are  important  is  that  they  behave  nicely  under  retiming. 
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Lemma  5.  Let  G  =  (V,  E,  d,  w)  be  a  sychronous  circuit,  and  let  W  and  D  be 
defined  on  G  by  the  equations  (4).  Let  r  be  a  legal  retiming  of  G,  and  let  Wr  and 
Dr  be  defined  analogously  on  Gr.  Then 

5.1.  a  path  p  is  a  critical  path  of  Gr  if  and  only  if  it  is  a  critical  path  of  G, 

5.2.  Wr(tt,  v)  =  +  r(v)  —  r(u)  for  all  connected  vertices  u,  v  €  V ,  and 

5.3.  Dr( u,v)  =  D(u,v )  for  all  connected  vertices  u,v  €  V . 

Proof.  Assertion  5.1  follows  from  Lemma  1  because  retiming  changes  the  weights  of  all 
paths  from  u  to  v  by  the  same  amount,  and  then  5.2  follows  immediately.  Assertion  5.3  is  a 
consequence  of  5.2  together  with  the  fact  that  retiming  does  not  alter  propagation  delays.  5 

The  next  result  is  a  corollary  to  Lemma  5  which  shows  that  the  range  of  D  contains  the 
clock  periods  of  all  circuits  obtainable  by  retiming  G.  In  Figure  4  the  20  distinct  values  in 
the  table  for  D  include  all  possible  clock  periods  for  any  retiming  of  Correlator  1. 

Corollary  0.  Let  G  =  (V,  E,  d,  w)  be  a  synchronous  circuit,  and  let  r  be  a  retiming 
of  G.  Then  the  clock  period  $(G>)  is  equal  to  D{u,v)  for  some  u,v  6  V. 

Proof.  By  the  definition  of  clock  period,  the  circuit  Gr  contains  some  zero-weight  path  u  v 
such  that  d(p)  =  $(G>),  and  thus  we  have  Wr( u,i>)  =  wr(p)  —  0.  Moreover,  no  zero-weight 
path  in  Gr  has  greater  propagation  delay  than  p,  which  implies  Dr(u,v)  =  d(p).  Hence,  by 
Lemma  5  we  have  $(Gr)  =  Dr(u,v)  =  D(u,v).  | 

Lemma  4  and  Lemma  5  also  allow  us  to  characterize  the  conditions  under  which  a  retiming 
produces  a  circuit  whose  clock  period  is  no  greater  than  a  given  constant. 

Theorem  7.  Let  G  =  (V,  E,d,w)  be  a  synchronous  circuit,  let  c  be  an  arbitrary 
positive  real  number,  and  let  r  be  a  function  from  V  to  the  integers.  Then  r  is  a 
legal  retiming  of  G  such  that  $(Gr)  <  c  if  and  only  if 

7.1.  r(u)  —  r(u)  <  u/(e)  for  every  edge  u  A  v  of  G,  and 

7.2.  r(u)  —  r( v)  <  W(ts>v}  —  1  for  all  vertices  u,t>  €  V  such  that  D(u,v)  >  c. 

Proof.  By  Corollary  3,  the  retiming  r  is  les^-i  if  and  only  if  Assertion  7.1  holds.  If  r  is  indeed 
a  legal  retiming  of  G,  then  by  Lemma  4  the  retimed  circuit  Gr  has  clock  period  $(G>)  <  c 
precisely  under  the  condition  that  Wr(u ,  v)  >  1  for  all  vertices  u,  v  €  V  such  that  Dr(ut  v)  >  c. 
Since  by  Lemma  5  we  have  ^(ujv)  =  W(u,v)  -1-  r(v)  —  r(u)  and  Dr(u,v)  =  L>(u,v),  this 
condition  is  equivalent  to  Assertion  7.2.  E 

Theorem  7  provides  the  basic  tool  needed  to  solve  the  clock  period  minimization  problem. 
Notice  that  the  constraints  on  the  unknowns  r(v)  in  the  theorem  are  linear  inequalities  involv¬ 
ing  only  differences  of  unknowns,  and  thus  we  have  an  instance  of  Problem  LP.4  Therefore, 
using  the  Bellman-Ford  algorithm  to  test  whether  a  retimed  circuit  exists  with  clock  period 
less  than  some  constant  c  takes  0(|V|S)  time  since  there  can  be  only  0(|V'|3)  inequalities. 

*  Actually,  we  have  the  integer  linear  programming  version  of  the  problem  became  the  unknown*  r(v)  are 
required  to  be  integer.  Since  the  value  on  the  right-hand  aide  of  each  equation  i*  integer,  however,  the  Bellman- 
Ford  algorithm  produce*  an  integer  optimal  aolntion  if  one  exiat*. 


We  now  present  an  algorithm  to  determine  a  retiming  for  a  circuit  G  such  that  the  clock 
period  of  the  retimed  circuit  is  minimized. 

Algorithm  OPT1  (Clock  period  minimization) .  Given  a  synchronous  circuit  G  =  (V,  E,  d,  w) , 
this  algorithm  determines  a  retiming  r  such  that  4(G>)  is  as  small  as  possible. 

1.  Compute  W  and  D  using  Algorithm  WD. 

2.  Sort  the  elements  in  the  range  of  D. 

3.  Binary  search  among  the  elements  D(u,v)  for  the  minimum  achievable  clock  period. 
To  test  whether  each  potential  clock  period  e  is  feasible,  apply  the  Bellman-Ford 
algorithm  to  determine  whether  the  conditions  in  Theorem  7  can  be  satisfied. 

4.  For  the  minimum  achievable  clock  period  found  in  Step  3,  use  the  values  for  the  r(v) 
found  by  the  Bellman-Ford  algorithm  as  the  optimal  retiming.  | 

Algorithm  OPT1  runs  in  0(|V|S  lg  |V|)  time.  For  some  circuits,  we  can  sometimes  improve 
the  performance  of  Algorithm  OPTl  by  using  a  smaller  set  of  inequalities.  (An  algorithm  with 
provably  better  asymptotic  performance  is  given  in  the  next  section.)  The  key  observation 
is  that  we  may  eliminate  any  inequality  r(u)  -  r(v)  <  W(u,v)  —  1  from  Item  7.2  if  either 
D(u,  v)  —  d(v)  >  c  or  D(u,  v)  —  d(u)  >  c.  The  intuition  behind  this  optimisation  is  that  there 
is  no  need  to  explicitly  require  a  path  p  to  have  positive  weight  u;r(p)  if  we  already  require 
some  subpath  of  p  to  have  positive  weight. 


Figure  5.  The  graph  model  of  an  optimal  correlator  with  clock  period  13.  The  circuit  is  obtained 
from  the  graph  of  Correlator  1  in  Figure  3  by  applying  the  optimal  retiming,  determined  by 
Algorithm  OPTl.  For  each  vertex  v,  the  value  r(v)  is  the  lag  of  v  with  respect  to  the  corresponding 
vertex  in  Correlator  1.  The  retimed  weight  of  an  edge  u  -2*  v  is  given  by  u>r(e)  =*  u/(e)-f  r(v)  — r(tt). 

As  an  example,  Figure  5  shows  the  circuit  graph  of  Correlator  3,  which  can  be  obtained 
from  Correlator  1  by  applying  the  Bellman-Ford  algorithm  to  the  inequalities  from  Theorem  7 
with  clock  period  e  =  13.  There  are  11  inequalities  (one  for  each  edge)  that  must  be  satisfied 
to  ensure  a  legal  retiming — Item  7.1  in  Theorem  7.  Of  the  potential  34  inequalities  arising 
from  cases  where  D(u,v)  >  13 — Item  7.2  in  the  theorem — only  five  need  be  included  if  we 
eliminate  those  for  which  either  D(u,v)  —  d( v)  >  13  or  D(u,v)  —  d(u)  >  13.  In  the  table  for 
D  in  Figure  4,  those  entries  corresponding  to  the  five  relevant  inequalities  are  circled. 


5.  A  more  efficient  algorithm  for  clock  period  minimization 

In  this  section  we  describe  an  asymptotically  more  efficient  algorithm  for  the  clock  period 
minimization  problem.  Specifically,  we  will  show  that  the  feasible  clock  period  test  in  Step  3 
of  Algorithm  OPT1,  which  determines  whether  there  exists  a  retiming  of  G  with  clock  period 
at  most  c,  can  be  performed  in  O  (|V|  |E|)  time,  a  significant  improvement  over  o(|V|s) 
for  sparse  graphs.  This  result  yields  an  0(|V(  jSjlg  \V\)  time  algorithm  for  determining  the 
optimal  retiming. 

We  begin  with  the  0(\V\  |E|)-time  algorithm  for  determining  whether  a  given  clock  period 
is  feasible. 

Algorithm  FEAS  (Feasible  clock  period  test).  Given  a  synchronous  circuit  G  =  (' V,E,d,w ) 
and  a  desired  clock  period  c ,  this  algorithm  produces  a  retiming  r  of  G  such  that  G,  is  a 
synchronous  circuit  with  clock  period  $(G>)  <  c,  if  such  a  retiming  exists. 

1.  For  each  vertex  v  €  V,  set  r(v)  «—  0. 

2.  Repeat  the  following  |V|  —  1  times: 

2.1.  Compute  graph  G,  with  the  existing  values  for  r. 

2.2.  Run  Algorithm  CP  on  the  graph  Gr  to  determine  A(v)  for  each  vertex  v  €  V. 

2.3.  For  each  v  such  that  A(v)  >  c,  set  r(v)  *—  r(v)  +  1. 

3.  Run  Algorithm  CP  on  the  circuit  G,.  If  we  have  $(Gr)  >  c,  then  no  feasible  retiming 
exists.  Otherwise,  r  is  the  desired  retiming.  | 

Proof  of  correctness .*  Algorithm  FEAS  works  by  relaxation.  Step  I  specifies  an  initial 
tentative  retiming  in  which  each  vertex  has  zero  lag  (so  that  G,  =  G).  Each  iteration  of  Step  2 
is  equivalent  to  one  pass  of  a  Bellman-Ford  algorithm  on  the  constraints  in  Theorem  7.  We 
assume  that  the  tentative  values  produced  during  each  pass  over  the  constraint  set  depend 
only  on  the  tentative  values  from  the  previous  pass. 

After  each  iteration  of  Step  2,  the  tentative  retiming  is  guaranteed  to  be  legal.  Consider 
an  edge  u  v  in  Gr.  If  the  retimed  weight  tur(e)  is  strictly  positive  at  the  beginning  of  the 
iteration,  then  it  will  be  nonnegative  at  the  end  of  the  iteration  because  r(u)  can  increase  by 
at  most  I  and  r(v)  cannot  decrease.  If  wr(e)  =  0  at  the  beginning  of  the  iteration  and  if  r(u) 
is  incremented,  then  r(v)  will  be  incremented  well  because  A(v)  >  A(u)  +  d(u)  >  A(u)  >  c 
in  this  case. 

It  remains  to  show  that  Step  2  simulates  a  pass  of  a  Bellman-Ford  algorithm  on  the 
constraints  from  Theorem  7.  Since  the  tentative  retiming  is  always  legal  at  the  beginning 
of  an  iteration,  the  constraints  7.1  are  already  satisfied.  Thus,  the  relaxation  step  in  the 
inner  loop  of  the  Bellman-Ford  algorithm  does  not  change  the  value  of  any  r(v)  for  these 
constraints. 

To  see  that  the  effects  of  the  relaxations  due  to  the  constraints  7.2  are  achieved,  consider 
any  two  vertices  u,v  €  V.  If  we  have  Z?(u,  v)  <  e,  then  no  inequality  for  critical  paths 
from  u  to  v  occurs  in  this  constraint  set.  If  the  retimed  critical  path  weight  W,(u,v)  = 
W( u,t>)  +  r(v)  -  r(u)  is  positive,  then  the  corresponding  inequality  in  the  constraint  set  is 
already  satisfied.  Finally,  when  D{ utv)  >  c  and  W{u,v)  =  0,  there  is  some  path  u  A*  v 


5  A  more  detailed  proof  can  be  found  in  [17]. 


such  that  tvr(p)  =  Wr(u,v)  =  0  and  d(p)  =  D(u,v).  The  existence  of  this  path  implies 
that  A(v)  >  d(p)  —  D(u,v)  >  e,  so  that  r(v)  will  be  given  the  new  value  r(t/)  +  1  = 
r(u)  +  Wr  (u,  v)  —  W{u,  v)  + 1  =  r(u)  —  W (a,  v)  + 1,  precisely  achieving  the  effect  of  the  desired 
relaxation  of  the  constraint  r(u)  —  r(t/)  <  W( u,t/)  -  1.  Conversely,  r(v)  is  incremented  only 
when  there  exists  some  path  x  v  such  that  wr(p)  =  0  and  d(p)  =  A(v)  >  e,  implying  that 
D(x,v)  >  d(p)  >  c  and  Wr(u,v)  =  0.  | 

Using  Algorithm  FEAS  in  Step  3  of  Algorithm  OPTl  yields  the  following  improved  algo¬ 
rithm  for  clock  period  minimization. 

Algorithm  OPT2  ( Clock  period  minimization).  Given  a  synchronous  circuit  G  =  (V,  E,  d,  w), 
this  algorithm  determines  a  retiming  r  such  that  $(Gr)  is  as  small  as  possible. 

1.  Compute  W  and  D  using  Algorithm  WD. 

2.  Sort  the  elements  in  the  range  of  D. 

3.  Binary  search  among  the  elements  D(u,  v)  for  the  minimum  achievable  clock  period. 
To  test  whether  each  potential  clock  period  e  is  feasible,  apply  Algorithm  FEAS. 

4.  For  the  minimum  achievable  clock  period  found  in  Step  3,  use  the  values  for  the  r(v) 
found  by  Algorithm  FEAS  as  the  optimal  retiming.  | 

Algorithm  OPT2  runs  in  0(|V|  |E|  lg  (VI)  time. 

6.  A  mathematical  programming  framework  for  retiming 

In  this  section  we  describe  another  algorithm  for  clock  period  minimization  based  on  a  special 
case  of  mixed-integer  linear  programming.  Specifically,  we  will  show  that  the  feasible  clock 
period  test,  which  determines  whether  there  exists  a  retiming  of  G  with  clock  period  at  most 
c,  can  be  performed  in  O  (|V|  |£|  Ig  | V J)  time.  Although  this  bound  is  not  an  improvement 
over  the  O  (|V|  \E\)  bound  for  Algorithm  FEAS,  the  mathematical  programming  framework 
in  this  section  provides  further  insight  into  retiming. 

The  feasible  clock  period  test  can  be  reduced  to  the  following  mixed-integer  programming 
problem. 

Problem  MILP.  Let  S  be  a  set  of  m  linear  inequalities  of  the  form  x}  —  x<  <  a,; 
on  the  unknowns  xlf  x2, . . . ,  xn,  where  the  a,,  are  given  real  constants,  and  let  k  be 
given.  Determine  feasible  values  for  the  unknowns  x,  subject  to  the  constraint  that 
x i  is  integer  for  i  =  1, 2, . . . ,  k  and  real  for  i  =  k  +  1,  k  +  2, . . . ,  n,  or  determine 
that  no  such  values  exist. 

Although  mixed-integer  programming  is  in  general  NP-complete  (because  integer  pro¬ 
gramming  is  (3,  p.  245]),  this  special  case  cm  be  solved  in  0(mn  +  fcmlgn)  time  (13).  The 
reduction  of  the  feasible  clock  period  test  to  Problem  MILP  makes  use  of  the  following  lemma. 

Lemma  8.  Let  G  =  {V,  E,  d,  w)  be  a  synchronous  circuit,  and  let  c  be  a  positive 
real  number.  Then  the  clock  period  $(G)  is  less  than  or  equal  to  c  if  and  only  if 
there  exists  a  function  s  :  V  — ♦  [0,c]  such  that  s(u)  >  d(v)  for  every  vertex  v  and 
such  that  «(v)  >  s(u)  +  d(v)  for  every  zero-weight  edge  u  A 


v. 


Proof.  For  each  vertex  v,  let  A(v)  be  the  maximal  sum  of  the  combinational  delays  along 
any  zero-weight  path  that  ends  at  v.  (This  A  is  the  same  as  the  one  in  Algorithm  CP.)  5y 
definition,  we  have  4(G)  <  e  if  and  only  if  A(v)  <  e  for  all  v.  If  we  have  4(G)  <  c,  the 
function  A  satisfies  the  desired  properties  for  s.  Conversely,  if  a  function  s  exists  having  the 
desired  properties,  then  we  have  A(v)  <  s(t>)  <  c  for  every  vertex  v.  | 

Lemma  8  and  Corollary  3  together  give  a  characterization  of  when  it  is  possible  to  retime 
a  circuit  so  that  the  retimed  circuit  has  a  clock  period  of  e  or  less. 

Lemma  0.  Let  G  =  (V,  E,  d,  w)  be  a  synchronous  circuit,  and  let  c  be  a  positive 
real  number.  Then  there  exists  a  retiming  r  of  G  such  that  4(G,)  <  c  if  and  only 
if  there  exists  an  assignment  of  a  real  value  s(v)  and  an  integer  value  r(y)  to  each 
vertex  v  €  V  such  that  the  following  conditions  are  satisfied: 

9.1.  —  «(v)  <  —d(v)  for  every  vertex  v  €  V, 

9.2.  s(v)  <  c  for  every  vertex  v  €  V, 

9.3.  r(u)  —  r(v)  <  w(e)  wherever  bAb,  and 

9.4.  «(tt)  —  s(v)  <  —d(v)  wherever  u  A  v  such  that  r(u)  —  r(v)  =  tv(e). 

Proof.  Condition  9.3  captures  the  requirements  for  r  to  be  a  legal  retiming,  as  given  in 
Corollary  3,  namely  that  r  be  a  mapping  from  V  to  Z  such  that  Gr  satisfies  Condition  Wl. 
Conditions  9.1,  9.2,  and  9.4  capture  the  requirement  for  Gr  to  have  a  clock  period  4(Gr)  <  c 
as  given  in  Lemma  8.  (Recall  that  G ,  is  defined  to  have  wr{e)  =  te(e)  +  r(v)  —  r(u)  for  each 
edge  u  v.)  | 

Unfortunately,  this  result  does  not  quite  allow  us  to  recast  an  instance  of  the  feasible 
clock  period  test  as  an  instance  of  Problem  MILP  because  of  the  qualifying  clause  “such  that 
r(u)  —  r(t>)  =  to(e)B  in  Condition  9.4.  The  next  theorem  shows  that  the  conditions  can  be 
expressed  without  such  a  clause. 

Theorem  10.  Let  G  =  ( V ,  E,  d,  w)  be  a  synchronous  circuit,  and  let  c  be  a  positive 
real  number.  Then  there  is  a  retiming  r  of  G  such  that  4(Gr)  <  c  if  and  only  if 
there  exists  an  assignment  of  a  real  value  iZ(v)  and  an  integer  value  r(v)  to  each 
vertex  v  €  V  such  that  the  following  conditions  are  satisfied: 

10.1.  r(t>)  —  R(v)  <  -d(v)/c  for  every  vertex  v  6  V, 

10.2.  i2(v)  —  r(v)  <  1  for  every  vertex  v  €  V, 

10.3.  r(u)  —  r(v)  <  w(e)  wherever  u  A  v,  and 

10.4.  jR(u)  —  R(v)  <  u»(e)  -  d{v)/c  wherever  u  A  v. 

Proof.  Any  solution  to  the  conditions  in  Lemma  9  can  be  converted  to  a  solution  to  the 
conditions  above  by  using  the  same  values  for  the  r(v)  and  taking  J2(v)  =  r(v)  +  $(v)/c 
for  each  vertex  v.  Conversely,  any  solution  to  the  conditions  above  yields  a  solution  to  the 
conditions  in  Lemma  9  using  the  substitution  s(v)  =  c  (R(v)  —  r(v)).  | 

Theorem  10  is  the  basis  for  the  following  improvement  on  Algorithm  OPTl. 

Algorithm  OPT3  ( Clock  period  minimization).  Given  a  synchronous  circuit  G  =  (V,  E,  d,  w) , 
this  algorithm  determines  a  retiming  r  such  that  4(Gr)  is  as  small  as  possible. 
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1.  Compute  W  and  D  using  Algorithm  WD. 

2.  Sort  the  elements  in  the  range  of  D. 

3.  Binary  search  among  the  elements  D(u,  v)  for  the  minimum  achievable  clock  period. 
To  test  whether  each  potential  clock  period  e  is  feasible,  solve  Problem  MILP  to 
determine  whether  the  conditions  in  Theorem  10  can  be  satisfied. 

4.  For  the  minimum  achievable  clock  period  found  in  Step  3,  use  the  values  for  r  found 
by  the  algorithm  that  solves  Problem  MILP  as  the  optimal  retiming.  | 

This  algorithm  can  be  made  to  run  in  0(\V\  |£|  lg  |V|  +  |V j*  lgJ  jV|)  time  by  choosing  effi¬ 
cient  algorithms  for  each  of  the  steps.  If  Johnson’s  all-pairs  shortest-patns  algorithm  [5]  using 
the  Fibonacci  heap  data  structure  due  to  Fredman  and  Tar  j  an  [l]  is  used  in  Algorithm  WD, 
Step  1  runs  in  0(| V\  \E\  +  |V|*  lg  |V|)  time.  Since  there  are  only  0(|V|J)  elements  in  the  range 
of  D,  Step  2  runs  in  0(|Vr|,lg|Vr|)  time.  Each  iteration  of  the  binary  search  in  Step  3  re¬ 
quires  solving  an  instance  of  Problem  MILP  with  \V\  integer  variables,  IV)  real  variables,  and 
2  |V|+2  |E|  =  0(E)  inequalities.  Thus  the  total  time  for  Step  3  is  0(|V|  |2£|  lg  |V'|+|V|l  lga  |V|) 
if  the  algorithm  from  [13]  is  used.  The  optimal  retiming  from  Step  4  is  produced  as  a  side 
effect  of  Step  3. 

7.  Unit  propagation  delay,  systolic  circuits,  and  slowdown 

This  section  examines  circuits  in  which  the  propagation  delays  of  all  functional  elements 
are  equal.  For  such  circuits,  the  clock  period  minimization  problem  can  be  solved  more 
simply  than  for  arbitrary  circuits.  In  this  section,  we  explore  the  relation  of  this  class  of 
circuits  to  systolic  computation  [6],  [7],  [9],  [12].  We  observe  that  many  systolic  circuits  in  the 
literature  can  support  several  independent,  interleaved  computations.  In  [12],  we  introduced 
a  transformation  called  slowdown  which,  when  coupled  with  retiming,  can  be  used  to  produce 
a  systolic  circuit  from  an  arbitrary  synchronous  circuit.  In  this  section,  we  give  an  efficient 
algorithm  for  determining  whether  a  given  circuit  can  be  produced  from  another  circuit  by 
slowdown  and  retiming.  If  so,  it  determines  that  other  circuit. 

We  define  a  circuit  G  =  (V,  E,  d,  iv)  to  be  a  unit-delay  circuit  if  each  vertex  v  6  V  has 
propagation  delay  d(v)  =  1.  The  next  theorem  gives  a  characterization  of  when  a  unit-delay 
circuit  has  clock  period  less  than  or  equal  to  c.  The  theorem  is  phrased  in  terms  of  the  graph 
G  —  1/c,  which  is  defined  as  G  —  1/c  =  {V,  E,  d,  to')  where  to'(e)  =  u>(e)  —  1/e  for  every  edge 
t  €.  E.  Thus  G  —  1/e  is  the  graph  obtained  from  G  by  subtracting  1/e  from  the  weight  of 
each  edge  in  G. 

Theorem  11.  Let  G  =  ( V,  E,  d,  w )  be  a  unit-delay  synchronous  circuit,  and  let 
c  be  any  positive  integer.  Then  there  is  a  retiming  r  of  G  such  that  $(G>)  <  c  if 
and  only  if  G  —  1/c  contains  no  cycles  having  negative  edge  weight. 

Proof.  First,  suppose  G  —  1/c  has  no  negative-weight  cycles.  We  shall  produce  a  retiming  r 
of  G  such  that  $(G>)  <  c.  Assume  without  loss  of  generality  that  there  is  a  path  from  each 
vertex  v  of  G  to  some  vertex  vq  (if  not,  add  edges  of  the  form  v  — *  vq  with  sufficiently  large 


weight  so  that  no  negative- weight  cycles  are  introduced  into  G  —  1/c),  and  let  y(v)  be  the 
weight  of  the  shortest  path  from  v  to  v0  in  C?  —  1/c.  For  each  vertex  v,  let  r(v)  =  fp(v)]. 

We  now  prove  that  the  r(v)  so  defined  are  a  legal  retiming  and  that  $(G>)  <  e.  First 
we  show  legality  by  showing  wr(e)  =  to(e)  +  r[v)  —  r(u)  >  0  for  every  edge  u  A  t>.  Since 
the  shortest  path  from  u  to  vo  is  at  least  as  short  as  the  path  from  u  that  takes  e  and 
then  the  shortest  path  from  v  to  vq,  we  have  g(u)  <  g(v)  +  w(e)  —  1/c.  Taking  ceilings 
of  both  sides  gives  r(u)  <  fp(v)  +w(e)  -  1/c]  <  fp(t;)]  +  tt/(e)  <  r(v)  +  tu(e),  and  thus 
wr(e)  =  w(e)  +  r(t>)  —  r(«)  >  0,  as  desired. 

Next,  we  must  show  that  the  clock  period  of  the  retimed  circuit  Gr  is  at  most  c.  That 
is,  we  must  show  that  wr(p)  >  1  for  any  path  u  v  containing  c  or  more  edges.  The 
shortest  path  from  u  to  uo  is  at  least  as  short  as  the  path  from  u  that  takes  p  and  then 
the  shortest  path  from  v  to  vQ,  and  the  total  weight  along  p  in  G  —  1/c  is  at  most  u/(p)  —  1 
since  there  are  at  least  e  edges  in  the  path.  Thus  we  have  g(u)  <  p(v)  +  w(p)  —  1,  and 
tur(p)  =  w(p)  +  \g{v)]  -  [y(u)l  >  1. 

On  the  other  hand,  suppose  G  -  1/c  contains  some  cycle  p  with  negative  weight.  We  must 
prove  that  G  cannot  be  retimed  to  have  a  clock  period  of  c  or  less.  Let  n  be  the  number  of 
edges  in  the  cycle  p.  By  the  definition  of  G  —  1/c,  we  have  tu(p)  —  n/c  =  tu'(p),  where  w' 
is  the  edge-weight  function  for  G  —  1/c.  But  by  supposition,  tu'(p)  is  negative,  which  means 
that  w(p)  —  n/c  <  0,  that  is,  the  cycle  p  contains  fewer  than  n/c  registers  in  G.  But  retiming 
leaves  the  number  of  of  registers  on  any  cycle  unchanged  (Corollary  2).  Thus,  no  matter 
how  the  fewer  than  n/c  registers  are  distributed  on  the  cycle  of  n  vertices,  there  can  be  no 
registers  on  some  string  of  more  than  c  vertices.  Consequently,  G  cannot  be  retimed  to  have 
a  clock  period  of  c  or  less.  | 

To  test  whether  there  is  a  retiming  r  of  a  unit-delay  circuit  G  such  that  $(G>)  <  c,  we 
can  use  the  Bellman-Ford  algorithm  to  find  the  weight  p(v)  of  the  shortest  path  in  G  —  1/c 
from  each  vertex  v  to  an  arbitrary  vertex  v<>.  If  the  shortest-path  weights  are  not  well  defined, 
the  Bellman-Ford  algorithm  detects  a  negative-weight  cycle,  which  means  that  no  retiming 
exists.  Thus,  the  feasible  clock  period  test  can  be  performed  in  0(|V|  |£|)  time  for  unit-delay 
circuits  using  the  Bellman-Ford  algorithm  directly. 

A  systolic  circuit  is  a  unit-delay  circuit  in  which  there  is  at  least  one  register  along  every 
interconnection  between  two  functional  elements.  Thus  the  clock  period  of  a  systolic  circuit  is 
the  minimum  possible — the  propagation  delay  through  a  single  functional  element.  Systolic 
circuits  have  been  studied  extensively  ([6],  [7],  [9],  [12]),  and  they  have  many  applications 
including  signal  processing,  matrix  manipulation,  machine  vision,  and  raster  graphics. 

Interpreted  in  the  context  of  systolic  circuits,  Theorem  11  is  a  generalization  of  the  Sys¬ 
tolic  Conversion  Theorem  from  [12],  which  says  that  G  can  be  retimed  to  be  systolic  if  the 
constraint  graph  G  -  1  has  no  cycles  of  negative  weight.  (Simply  restrict  Theorem  11  to  the 
case  where  c  =  1.)  The  Systolic  Conversion  Theorem  is  generalized  in  a  different  way  in  [12], 
however,  through  the  idea  of  slowdown. 

For  any  circuit  G  =  (V,  E,  d,  w)  and  any  positive  integer  c,  the  circuit  cG  is  the  circuit 
obtained  by  multiplying  all  the  register  counts  in  G  by  c.  That  is,  the  circuit  cG  is  defined 
as  cG  =  (V,E,d,w')  where  tu'(e)  =  cw{c)  for  every  edge  c  6  E.  All  the  data  flow  in  cG  is 
slowed  down  by  a  factor  of  c,  so  that  cG  performs  the  same  computations  as  G,  but  takes  c 
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times  as  many  clock  ticks  and  communicates  with  the  external  interfaces  only  on  every  cth 
clock  tick.  In  fact,  cG  acts  as  a  set  of  e  independent,  interleaved  versions  of  G. 

We  refer  to  any  circuit  obtained  by  retiming  cG  as  a  c-alow  version  of  G.  The  main 
advantage  of  a  c-slow  circuit  cG  is  that  it  can  often  be  retimed  to  have  a  shorter  clock  period 
than  any  retimed  version  of  the  original.  For  some  applications,  throughput  is  the  issue,  and 
multiple,  interleaved  streams  of  computation  can  be  effectively  utilized.  A  c-slow  circuit  that 
is  systolic  offers  maximum  throughput. 

The  following  corollary  to  Theorem  11  tells  when  a  circuit  has  a  c-slow  version  which  is 
systolic. 

Corollary  12.  Let  G  =  (V,E,d,w)  be  a  unit-delay  synchronous  circuit,  and  let 
c  be  an  arbitrary  positive  integer.  Then  the  following  are  equivalent: 

12.1.  The  graph  G  —  1/c  has  no  negative-weight  cycles. 

12.2.  The  circuit  G  can  be  retimed  to  have  clock  period  less  than  or  equal  to  c. 

12.3.  The  c-slow  circuit  cG  can  be  retimed  to  be  systolic. 

Proof .  That  12.1  and  12.2  are  equivalent  is  exactly  Theorem  11.  The  equivalence  of  12.1  and 
12.3  follows  by  applying  Theorem  11  to  cG  with  clock  period  1,  and  observing  that  cG  —  1 
has  a  negative-weight  cycle  if  and  only  if  G  —  1/c  has  a  negative-weight  cycle.  | 

The  registers  of  a  c-slow  circuit  cG  are  naturally  divided  into  c  equivalence  classes.  Given 
any  two  registers  A  and  B  in  cG,  the  number  of  registers  on  any  two  paths  from  register  A 
to  register  B  are  congruent  modulo  c.  Moreover,  if  we  consider  undirected  paths,  in  which 
edges  can  be  traversed  in  the  reverse  direction,  and  if  we  generalize  the  notion  of  path  weight 
by  adding  1  for  each  register  on  a  forward  edge  and  subtracting  1  for  each  register  on  a 
reverse  edge,  the  register  counts  of  two  undirected  paths  from  register  A  to  register  B  are 
also  congruent  modulo  c.  Consequently,  the  registers  are  naturally  divided  into  equivalence 
classes  according  to  their  undirected  path  weight  (modulo  c)  from  an  arbitrary  vertex. 

At  any  given  time  step,  any  two  registers  in  different  equivalence  classes  contain  data 
from  independent  streams  of  computation— data  that  can  never  arrive  at  inputs  of  the  same 
functional  element  at  the  same  time.  Although  retiming  destroys  the  individual  identities  of 
the  registers,  Lemma  1  guarantees  that  the  registers  of  any  retimed  c-slow  circuit  can  still  be 
partitioned  into  c  such  equivalence  classes. 

Using  the  notion  of  equivalence  classes  of  registers,  the  following  scenario  illustrates  the 
relationships  given  in  Corollary  12.  Let  G  =  (V,  E,  d,  w)  be  a  unit-delay  synchronous  circuit. 
Find  an  integer  c  such  that  G  —  1/e  has  no  negative-weight  cycles,  and  consider  the  circuit 
cG.  There  is  a  retiming  r  of  cG  such  that  (cG)r  is  systolic.  If  we  remove  all  the  registers 
in  the  c-slow  circuit  ( cG)r  except  for  those  in  one  equivalence  class,  the  resulting  circuit  is  a 
retimed  version  of  the  original  circuit  G  whose  clock  period  is  less  than  or  equal  to  c. 

Many  systolic  circuits  appearing  in  the  literature  are  fundamentally  2-slow  or  3-slow— even 
if  the  ideas  of  slowdown  and  retiming  were  not  explicitly  used  in  their  design.  For  example,  the 
systolic  algorithms  for  band-matrix  multiplication  and  LU -decomposition  from  [7]  are  3-slow 
and  can  support  three  independent,  interleaved  streams  of  computation.  If  all  independent 
streams  of  computation  cannot  be  utilized  in  a  c-slow  circuit,  it  may  be  desirable  to  remove 


all  registers  except  for  those  in  one  equivalence  class.  The  following  algorithm  determ.aes 
a  circuit  is  actually  a  c-slow  version  of  another,  and  if  so,  produces  that  reduced  c'. cult. 

Algorithm  R  ( Remove  all  but  one  equivalence  class  of  registers  in  a  circuit ).  Giv:  <?. 
synchronous  circuit  G  —  ( V,E,d,w )  this  algorithm  determines  the  largest  e  and  produces  a 
reduced  circuit  G1  such  that  G  is  a  c-slow  version  of  G\ 

1.  For  each  vertex  v  €  V,  set  dist(v)  to  the  weight  of  some  undirected  path  from  v  to  an 
arbitrary  vertex  v0  €  V. 

2.  Compute  c  =  ged  jty(e)  +  dist(y)  —  dist(u)  :  u  A 

3.  For  each  vertex  v  €  V,  set  r(t>)  =  dist(v)  mod  c. 

4.  Produce  G'  =  {V,F?,d,ty'),  where  tu'(e)  =  (u/(e)  +  r(v)  —  r(u))  jc  for  each  edge  u  t\ 
Proo/  o/  correctness.  We  first  show  that  for  each  edge  u  A  v,  the  value  to'(e)  is  a  legal  register 
count.  By  construction,  c  evenly  divides  tw(e)  +  r(v)  —  r(u)  because  c  divides  w(e)  +  rf»st(v)  - 
dtst(u).6  Thus,  for  any  edge,  the  register  count  w'(e)  produced  in  Step  4  is  guaranteed  to  be 
an  integer.  In  addition,  to'(c)  is  guaranteed  to  be  strictly  greater  than  —1  because  r(u)  must 
be  less  than  c  and  to (c)  +  r(v)  is  at  least  0.  Since  we  have  just  shown  that  w'(e)  is  an  integer, 
it  must  be  nonnegative. 

The  construction  in  Step  4  directly  provides  the  identity  G'  —  Grj e,  and  thus  G  is  a  c-slow 
version  of  G'.  We  now  show  that  the  c  computed  in  Step  2  is  the  largest  possible.  Suppose 
there  is  a  c'  such  that  G  is  a  c'-slow  version  of  another  circuit  G'.  We  wish  to  show  that  c' 
divides  u>(e)  +  dist[y)  —  diet  (u)  for  each  edge  u  A  v,  and  thus  that  c'  divides  c.  For  every 
vertex  v  the  weights  of  all  undirected  paths  in  c'G1  from  v  to  v0  are  congruent  mod  lo  c'. 
Since  retiming  changes  all  path  weights  between  two  vertices  by  the  same  amount  (which 
is  provable  for  undirected  paths  by  generalizing  Lemma  1),  it  must  be  the  case  that  in  G, 
the  weights  of  all  undirected  paths  from  v  to  t>0  are  congruent  modulo  c\  In  particular,  the 
weight  of  the  path  u  —**  v0  that  determines  dist(u)  and  the  weight  w(e)  +  dist(v)  of  the  path 
u  -A  v  — «■  v0  must  be  congruent  modulo  c'.  Hence  c'  divides  u/(e)  +  dist(v )  —  uist(u).  | 

Step  1  of  Algorithm  R  can  be  performed  in  time  OflV|  +  \E\)  =  0(jP|)  by  depth-first 
search.  Step  3  runs  in  OflVj)  time,  and  Step  4  takes  G(|J5|)  time.  Step  2  takes  more  work, 
but  not  much  more.  The  computation  of  the  greatest  common  divisor  of  |£|  integers  can  be 
performed  in  0(|P|  +  lgx)  time,  where  z  the  least  nonzero  absolute  value  of  any  of  the 
numbers.  Just  start  with  this  value  x — which  can  be  found  in  0(|JE|)  time — as  a  tentative 
ged,  and  ged  in  each  of  the  other  numbers  in  any  order.  Each  mod  operation  in  Euclid’s 
algorithm  either  uses  up  one  of  the  |2?|  numbers,  or  else  divides  the  current  tentative  ged  by 
the  golden  ratio  (1  +  >/&)/ 2.7  (As  a  practical  matter,  starting  with  any  of  the  |£|  numbers 
as  the  initial  tentative  ged  would  give  reasonable  performance,  since  the  number  of  registers 
in  a  typical  circuit  is  much  less  than  exponential  in  the  number  of  edges.)  Thus  the  total 
running  time  of  Algorithm  R  is  0(|E|  +  lg  z). 

6  If  w(e)  +  dift(v)  —  tfut(u)  =  0  for  every  edge  u  A  v,  then  in  Step  2  we  get  c  =  oo.  Using  the  standard 
convention  that  z  mod  oo  —  x  and  z/oo  —  0,  Algorithm  R  yields  a  reduced  graph  G'  in  which  each  edge  u  A  v 
has  weight  tt/(e)  =  0. 

7That  is,  if  n  mod  operations  are  performed  to  compute  a  new  tentative  ged,  then  it  will  be  smaller  than 
the  old  tentative  ged  by  at  least  a  factor  of  ((1  +  y/S)/2)m~l. 


Observe  that  Algorithm  R  works  not  only  for  unit-delay  circuits,  but  for  any  synchronous 
circuit.  Furthermore,  when  the  extra  equivalence  classes  of  registers  from  a  c-slow  circuit  are 
removed,  the  clock  period  of  the  reduced  circuit  is  not  unduly  lengthened.  The  definition  of 
w1  in  Step  4  provides  that  for  any  path  p  of  weight  w(p)  >  e  in  G,  we  have  u»'(p)  >  0,  which 
implies  that  $(G')  <  cft(Gr).  To  guarantee  the  minimum  clock  period,  however,  the  reduced 
circuit  must  generally  be  retimed. 

A  systolic  circuit  that  is  naturally  c-slow  can  be  converted  by  Algorithm  R  into  a  circuit 
that  performs  an  operation  on  every  clock  tick  and  whose  clock  period  is  bounded  by  c. 
This  conversion  can  result  in  a  performance  advantage  because  in  practice,  there  are  time 
penalties  associated  with  the  loading  of  registers.  Because  of  this  overhead,  c  clock  ticks  of 
a  circuit  with  nominal  period  1  typically  use  more  time  than  one  clock  tick  of  a  circuit  with 
nominal  period  c.  Also,  a  reduction  in  registers  may  save  chip  area,  which  can  lead  to  further 
performance  improvements  since  the  wires  will  in  general  be  shorter.  A  possible  disadvantage 
of  reducing  the  number  of  equivalence  classes  of  registers  is  that  throughput  is  also  reduced 
in  cases  where  the  independent  streams  of  computation  might  be  effectively  utilized. 

8.  Register  minimization  and  fanout 

Thus  far,  we  have  concentrated  on  clock  period  as  the  objective  function  for  determining  a 
retiming.  In  Section  7,  however,  we  showed  that  the  number  of  registers  in  a  circuit  could 
sometimes  be  reduced  by  a  method  other  than  retiming.  This  section  shows  the  problem 
of  retiming  a  circuit  to  minimize  the  total  state  of  a  circuit  is  polynomial-time  solvable  by 
reducing  it  to  a  minimum-cost  flow  [8,  p.  129]  problem.  We  also  show  that  the  total  state 
of  a  circuit  can  be  minimized  subject  to  a  bound  on  the  clock  period.  These  results  can  be 
extended  to  reflect  the  widths  of  the  interconnections  and  ways  by  which  fanout  is  modeled. 

For  a  given  circuit  G  =  ( V,E,d,w ),  the  state  minimization  problem  is  to  determine  a 
retiming  r  such  that  the  total  state  S(G>)  =  wr(e)  of  the  retimed  circuit  is  minimized. 
By  the  definition  of  wr,  we  have 

S(Gr )  =  Jwr(<) 

«€« 

=  £  (w(e)  +  r(v)  -  r(u)) 

=  S(G)  +  ^  r(v)  (indegree(v)  —  outdegree(v))  . 

*€V 

Since  S{G)  is  constant,  minimizing  Sr(G)  is  equivalent  to  minimizing  the  quantity 

^2  r(v)  (indegree(v)  -  outdegree(v))  ,  (5) 

•€V 

which  is  a  linear  combination  of  the  r(v)  since  (indegree (v)  -  outdegree(v))  is  constant  for 
each  v.  The  minimization  is  subject  to  the  constraint  that  for  each  edge  uAn,  the  register 
count  u/f(e)  is  nonnegative — that  is, 

r(«)  —  r(v)  <  «>(«)  . 
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The  state  minimization  problem,  cast  in  this  manner,  is  the  linear-programming  dual  of 
a  minimum-cost  flow  problem  and  can  be  solved  in  polynomial  time  by  an  algorithm  due  to 
Edmonds  and  Karp  [8,  p.  131].  (As  a  practical  matter,  this  problem  can  be  efficiently  solved 
by  the  primal  simplex  technique  [4,  p.  186].)  Furthermore,  since  the  constants  tv(e)  in  the 
problem  are  all  integers,  there  is  an  integer  optimal  solution  to  the  r(v). 

More  complicated  problems  can  be  solved  within  the  same  framework.  For  example,  the 
total  state  of  a  circuit  can  be  minimized  subject  to  a  bound  on  the  clock  period.  Given  a 
maximum  allowable  clock  period  e,  we  wish  to  find  a  retiming  r  that  minimizes  the  state 
S(Gr)  of  the  retimed  circuit  subject  to  the  condition  that  $(Gr)  <  c.  In  this  case,  we  must 
minimize  the  quantity  (5)  subject  to  the  constraints  from  Theorem  7,  which  require  that 
r(u)  -  r(v)  <  w(e)  wherever  u  A  v,  and  that  r(u)  -  r(v)  <  W(ut  v)  -  1  wherever  D(u,  v)  >  c. 
The  state  minimization  problem  remains  the  dual  of  a  minimum-cost  flow  problem,  but  the 
minimization  is  subject  to  more  constraints. 

The  state  minimization  problem  can  be  generalized  by  allowing  registers  on  different  edges 
to  have  different  costs.  For  example,  it  may  be  cheaper  to  add  a  register  along  a  one-bit  wide 
control  path  than  along  a  32-bit  wide  data  path.  We  may  model  such  situations  by  assigning 
to  each  edge  e  a  breadth  /3(e)  proportional  to  the  cost  of  adding  a  register  along  e.  The 
objective  function  which  we  must  minimize  is  then  given  by 

£  (£  /*(«)  “  £  /?(«))  >  (6) 

and  the  constraints  on  the  r(v)  are  unchanged.  This  problem  is  still  the  dual  of  a  minimum- 
cost  flow  problem  since  the  quantity  in  the  large  parentheses  is  a  constant  for  each  v.  Although 
the  (3(e)  need  not  be  integers,  if  there  is  a  solution  to  the  state  minimization  problem,  there 
is  an  integer  optimal  solution  because  the  linear  programming  tableau  for  the  problem  is 
unimodular  and  the  right-hand  side  is  an  integer  vector.  If  the  (3(e)  are  arbitrary  numbers,  the 
minimum-cost  flow  algorithm  of  Edmonds  and  Karp  does  not  necessarily  run  in  polynomial 
time,  but  the  algorithm  of  Galil  and  Tardos  [2]  does. 

In  a  physical  circuit,  a  signal  from  a  register  or  functional  element  may  fan  out  to  several 
functional  elements.  As  was  mentioned  in  footnote  2  in  Section  2,  we  model  this  situation 
with  several  different  edges  in  the  circuit  graph.  For  the  clock  period  minimization  problem, 
there  was  no  harm  in  modeling  fanout  in  this  manner,  but  for  the  state  minimization  problem, 
there  can  be.  The  difficulty  that  arises  in  the  state  minimization  problem  is  that  registers 
can  be  shared  along  the  physical  interconnection.  The  objective  functions  (5)  and  (6)  do  not 
take  sharing  into  account. 


r(*)«2 

Figure  6.  Modeling  two-way  fanout  with  an  extra  vertex  having  delay  tero. 

Fanout  can  be  incorporated  into  the  model  in  several  ways  that  allow  the  sharing  of 
registers  to  be  accounted  for  exactly.  We  begin  by  looking  at  the  situation  in  Figure  6(a) 
where  one  vertex  u  has  an  output  that  fans  out  to  two  vertices  v2  and  v2.  To  deal  properly 
with  this  situation  in  the  state  minimization  problem,  it  is  sufficient  to  introduce  a  dummy 
vertex  u  with  zero  propagation  delay  which  models  the  fork  of  the  interconnection,  as  is 
shown  in  Figure  6(b).  When  the  circuit  is  retimed  to  minimize  the  number  of  registers,  either 
the  edge  from  u  to  Vi  or  the  edge  from  u  to  v2  will  have  zero  register  count,  and  the  edge 
from  u  to  &  will  have  the  shared  registers.  In  Figure  6(c)  the  edge  to  Vi  ends  up  with  zero 
weight  after  retiming  so  as  to  minimize  the  total  number  of  registers. 

Large  multiway  forks  present  some  modeling  alternatives  not  encountered  in  the  two-way 
case.  If  a  physical  interconnection  is  to  be  modeled,  the  fork  can  be  decomposed  into  several 
two-way  forks.  (In  fact,  our  concern  for  modeling  the  physical  interconnection  prompted  us  to 


design  Correlator  1  with  the  x%  running  through  the  comparators  rather  than  with  multiway 
fanout  directly  from  the  external  interface.) 

For  logical  design,  however,  it  may  be  undesirable  to  model  the  physical  interconnection 
In  the  case  of  a  three-way  fork,  for  instance,  we  might  wish  to  share  the  largest  possible  numbe- 
of  registers  between  the  two  edges  with  greatest  register  counts,  regardless  of  which  two  edges 
these  end  up  being.  Modeling  a  Jfc-way  fork  for  k  >  3  by  decomposing  the  interconnection 
into  two-way  forks  will  not  work. 

A  solution  to  this  problem  of  modeling  fc-way  fanout  with  maximum  register  sharing  is 
depicted  in  Figure  7.  An  output  of  vertex  u,  having  breadth  0,  fans  out  to  t/j, . . .  ,«*  along 
edges  u  ^  «i,u  ^  Uj ,...,u  ^  v».  In  the  retimed  circuit,  the  cc3t  of  this  fanout  should, 
be  0  times  the  maximum  of  retimed  edge  weights  twr(e,).  So  that  the  register  count  cost 
function  S(Gr)  will  properly  model  the  register  sharing,  we  first  add  a  dummy  vertex  u  with 
zero  propagation  delay.  Letting  wmtlX  =  maxi<,<*  tu(e,),  we  add  edges  v<  ti  with  weights 
tn(e,)  =  wmmx  -  t !>(«*).  Finally,  we  give  all  edges  e,  and  e<  breadths  of  0/k. 


Figure  T.  A  gadget  for  modeling  the  cost  of  multiway  fanout  with  maximal  sharing  of  registers. 

The  modified  circuit  graph  accurately  models  the  sharing  of  registers  among  the  edges  e, 
involved  in  the  fanout  when  the  state  is  minimized.  For  any  retiming  r,  Lemma  1  dictates 
that  the  weights  uv(p<)  of  all  paths  p<  =  u  v{  u  will  be  identical  since  they  are  identical 


in  the  unretimed  circuit.  The  retimed  register  counts  tur(c,)  axe  constrained  by  the  rest  of  the 
circuit,  but  the  weights  <&,(£<)  will  be  as  small  as  possible  because  &  is  a  sink  in  the  graph. 
Thus  the  register  count  of  one  of  the  if  will  be  zero,  and  therefore  the  weight  of  each  path  p, 
will  be  maxi<j£fc  tor(e,).  Since  there  are  k  paths,  each  with  breadth  0/k,  the  total  cost  of  the 
paths  will  be  •  maxi^*  wr(e,)  as  desired. 

9.  A  more  general  model  for  propagation  delay 

In  this  section  we  extend  the  methods  of  Sections  5  and  6  to  deal  with  functional  elements  in 
which  the  propagation  delays  through  individual  functional  elements  are  nonuniform.  In  an 
adder,  for  example,  the  propagation  delay  from  a  low-order  input  bit  to  a  high-order  output 
bit  may  be  far  greater  than  the  propagation  delay  from  a  low-order  input  bit  to  a  low-order 
output  bit  or  from  a  high-order  input  bit  to  a  high-order  output  bit.  Thus,  the  worst-case 
propagation  delay  through  two  cascaded  adders  can  be  much  less  than  twice  the  worst-case 
propagation  delay  through  a  single  adder.  This  section  gives  a  more  general  circuit  model  to 
handle  this  commonly  occur ing  situation.  We  show  how  the  retiming  problem  in  this  model 
can  be  reduced  to  simple  mixed-integer  programming  as  in  Section  6.  We  also  give  a  more 
efficient  relaxation  algorithm  similar  to  that  in  Section  5. 


Figure  8.  A  functional  element  with  nonuniform  propagation  delay*.  The  time  at  which  output 
X  must  settle  i*  either  4  esec  after  input  A  settle*  or  11  esec  after  input  B  settle*,  whichever  is 
later. 

We  may  take  into  account  nonuniform  propagation  delays  through  functional  elements  by 
modifying  the  model  for  synchronous  circuits  given  in  Section  2,  so  that  from  each  input  to 
each  output  of  a  given  functional  element,  an  independent  propagation  delay  may  be  assigned. 
Figure  8  shows  graphically  the  “insides”  of  a  functional  element  in  this  model.  The  vertex 
contains  internal  edges  drawn  from  a  set  F  C  E  x  E,  and  each  internal  edge  belongs  to 
a  single  vertex.  Each  normal  edge  e  goes  from  an  internal  edge  /»  to  an  internal  edge  /», 
and  we  adopt  the  notation  /«  A  /*.  The  propagation  delay  function  d,  rather  than  being  a 
function  from  V  to  the  nonnegative  reals,  is  a  function  from  F  to  the  nonnegative  reals.  For 
an  internal  edge  /  in  vertex  v,  the  value  d(f)  denotes  the  propagation  delay  through  /.  A 
given  output  of  a  functional  element  v  need  not  depend  on  all  the  inputs.  In  an  adder,  for 


example,  the  values  of  the  high-order  input  bits  have  no  effect  on  the  low-order  bits  of  the 
output. 

Paths  in  the  extended  model,  rather  than  going  from  vertex  to  vertex,  go  from  an  internal 
edge  to  an  internal  edge.  For  any  path  p  =  /0  ^  ^  •  •  •  V  /*,  the  delay  is  naturally 

defined  as  d(p)  =  <f(/0)  +  <f(/i)  +  •  •  •  +  d(/*),  and  the  register  count  of  the  path  is  defined  as 
u/(p)  =  tu(e0)  +  w(et)  +  •  •  •  +  tu(e*_i).  We  use  the  notation  u-^»»to  denote  that  path  p  goes 
from  some  internal  edge  of  vertex  u  to  some  internal  edge  of  vertex  v. 

The  clock  period  9(G)  of  a  circuit  G  =  (V,  E,  d,  w)  is  the  maximum  delay  along  any  path 
of  zero  weight.  We  define  retiming,  which  affects  w  but  not  d,  exactly  as  it  is  defined  in  the 
model  of  Section  2.* 

The  following  results  show  how  the  clock  period  minimization  problem  for  a  circuit  G 
under  the  extended  model  can  be  reduced  to  an  instance  of  Problem  MXLP  having  one  integer 
variable  for  each  functional  element  of  G  and  one  real  variable  for  each  edge  of  G.  The  results, 
which  parallel  Lemma  8,  Lemma  9,  and  Theorem  10,  are  presented  without  proof. 

Lemma  13.  Let  G  =  ly,  E,  d,  tv)  be  a  synchronous  circuit  in  the  extended  model, 
and  let  c  be  a  positive  real  number.  Then  the  clock  period  $(G)  is  less  than  or 
equal  to  c  if  and  only  if  there  exists  a  function  s  :  E  — ►  [0,  c]  such  that 

13.1.  s(e)  >  d(f)  wherever  f  A?,  and 

13.2.  e(e*)  >  a(ea)  +  d(f)  wherever  ?  -^  /  -^?  and  w(ea)  >  0. 

Lemma  14.  Let  G  —  (V,  E,  d,w)  be  a  synchronous  circuit,  and  let  c  be  a  positive 
real  number.  Then  there  is  a  retiming  r  of  G  such  that  9(Gr)  <  c  if  and  only  if 
there  exists  an  assignment  of  a  real  value  s(e)  to  each  edge  e  €  E  and  an  integer 
value  r(v)  to  each  vertex  v  €  V  such  that  the  following  conditions  are  satisfied: 

14.1.  —  s(e)  <  —d(f)  wherever  f  A?, 

14.2.  s(e)  <  c  for  every  edge  e  €  E, 

14.3.  r(u)  —  r(v)  <  w(e)  wherever  and 

14.4.  s(ea)  —  s(e»)  <  —d(f)  wherever  u  ^  f  ^?,  r(«)  —  r(v)  =  w(ea),  and  f 
is  an  internal  edge  of  v. 

Theorem  15.  Let  G  =  (V,E,d,w)  be  a  synchronous  circuit  in  the  extended 
model,  and  let  c  be  a  positive  real  number.  Then  there  is  a  retiming  r  of  G  such 
that  9(Gr)  <  c  if  and  only  if  there  exists  an  assignment  of  a  real  value  R(e)  to 
each  edge  e  €  E  and  an  integer  value  r(v)  to  each  vertex  v  €  V  such  that  the 
following  conditions  are  satisfied: 

15.1.  r(v)  —  J2(e)  <  —d(f)/c  wherever  v  -^?  and  f  is  an  internal  edge  of  v, 

“The  possibility  that  the  internal  connections  between  the  inputs  and  outputs  of  a  functional  element  may 
not  be  a  complete  bipartite  graph  gives  rise  to  some  technical  differences  between  the  extended  model  and  the 
model  of  Section  2.  First,  Condition  W2  need  only  be  imposed  for  those  cycles  in  which  consecutive  edges 
are  actually  connected  by  the  internal  data  paths  in  the  vertices.  Second,  retiming  may  not  be  the  only  way 
to  adjust  register  counts  so  that  function  is  guaranteed  to  be  preserved — if  the  (undirected)  graph  of  internal 
connections  in  some  functional  element  is  not  connected,  then  the  element  can  be  broken  up  into  two  or  more 
independent  components  which  can  be  given  different  lags. 


15.2.  12(e)  —  r( v)  <  1  wherever  v 

15.3.  r  (u)  —  r(v)  <  w(e)  wherever  and 

15.4.  JZ(c«)  —  12(e»)  <  tw(e«)  —  d[f)/c  wherever  ?^  / 

Theorem  15  says  that  the  problem  of  testing  whether  a  given  clock  period  is  feasible  for 
a  circuit  G  =  (V,E,d,w)  in  the  extended  model  can  be  efficiently  reduced  to  an  instance 
of  Problem  MILP  having  k  =  |V|  integer  variables,  n  —  k  =  \E\  real  variables,  and  m  = 
2\E\  +  2  |d|  inequalities,  where  |d|  is  the  number  of  pairs  (ea,  e*)  of  edges  for  which  d(eA,e*) 
is  defined.  Thus  the  feasible  clock  period  test  can  be  performed  in  O  (|d|  (|2?|  +  \V\  lg  |JE7[)) 
time. 

We  can  reduce  the  cost  of  the  feasibility  test  to  0(|V|  |F|)  by  using  an  algorithm  similar 
to  Algorithm  FEAS.  The  dominant  cost  of  this  computation  is  due  to  |V|  executions  of  an 
algorithm  similar  to  Algorithm  CP,  each  of  which  runs  in  0(|F|)  time. 

The  clock  period  minimisation  algorithm  for  circuits  in  the  extended  model  is  similar  to 
Algorithm  OPT2  which  performs  a  binary  search  over  a  set  of  poesible  values  for  the  minim*) 
clock  period.  By  the  same  argument  used  in  Corollary  6  for  the  model  of  Section  2,  the 
optimal  clock  period  must  be  equal  to  D(utv)  for  some  pair  of  vertices  u  and  v,  where  D 
and  W  are  defined  as  in  Equation  (4),  but  with  the  semantics  of  the  notation  interpreted  in 
the  extended  model.  The  values  D(u,t>)  for  all  connected  pairs  of  vertices  u  and  v  can  be 
found  in  O  (|£|  |F|  +  |V|  |E|  lg  \E\)  time  by  an  algorithm  similar  to  Algorithm  WD.  The  key 
step  is  to  apply  Johnson’s  all-purs  shortest-paths  algorithm  [5]  to  the  edge- weighted  graph 
H  =  (E,F,  tod),  where  the  weighting  function  is  defined  by  wd[f)  =  (u/(e),  —d(f))  wherever 
?  -A  /  in  G  (that  is,  wherever  e  -4?  in  H).  Using  Fibonacci  heaps  [1]  for  the  priority  queue  in 
Johnson’s  algorithm,  a  time  bound  of  0(\E\  |F|  +  |E|*  lg  |E|)  can  be  achieved.  One  additional 
observation  is  required  to  prove  the  claimed  time  bound  of  O  (|E|  |F|  +  |V|  \E\  lg  |j?|).  The 
dominant  cost  of  computing  W  and  D  by  Johnson’s  all-pairs  shortest  paths  algorithm  is  due 
to  |£|  applications  of  Dijkstra’s  algorithm  to  find  shortest  paths  from  each  vertex  e  G  H  to 
each  other  vertex.  Since  W  and  D  are  defined  on  V  x  V  rather  than  on  E  x  Et  however,  we 
really  only  need  to  solve  |V|  problems  of  the  form:  Given  a  vertex  v  6  V,  find  for  each  t  6  E 
the  weight  of  a  shortest  path  from  v  to  e.  Each  such  problem  can  be  solved  by  using  Dijkstra’s 
algorithm  to  find  the  shortest  path  weights  from  a  set  of  vertices  in  H  (namely,  from  any 
x  €  E  such  that  ?  A  v  in  G),  rather  than  from  a  single  vertex.  Using  Fibonacci  heaps,  the 
claimed  running  time  for  computing  W  and  D  is  obtained.  The  total  cost  of  clock  period 
minimization  in  the  extended  model  is  therefore  0(|J?|  |F|  +  |V|  |£|  lg  |£|)+0(|V|  |F|  lg  |V|)  = 
0(|E||FH-|V||F|lg|V|). 

10.  Concluding  remarks 

Our  goal  has  been  to  provide  a  general  framework  for  the  precise  understanding  of  circuit 
timing.  Through  the  use  of  a  simple  graph-theoretic  model,  we  have  been  able  to  cast  a 
variety  of  circuit  timing  problems  in  purely  combinatorial  terms.  We  believe  our  approach 
to  be  robust.  Many  other  circuit  models  and  many  other  circuit  problems  can  be  handled 
within  the  basic  framework.  We  take  time  here  to  discuss  a  few. 


Pipelining.  An  important  special  case  of  clock  period  minimization  is  the  problem  of 
optimally  pipelining  combinational  circuitry.  In  a  combinational  circuit,  all  register  counts 
are  zero,  and  thus  the  circuit  graph  is  acyclic.  We  can  consider  the  circuit  to  have  one  inptr 
interface  vt  and  one  output  interface  v0-  By  retiming  a  combinational  circuit  G,  we  can 
produce  a  pipelined  circuit  Gr  with  achieves  a  shorter  clock  period  at  the  cost  of  introducing 
a  latency  of  r(vo)  —  r(w/)  clock  ticks  for  signals  to  propagate  from  the  input  interface  vj  to 
the  output  interface  t ;<j. 

In  the  optimal  pipelining  problem ,  we  are  given  a  combinational  circuit  G  and  an  nonneg¬ 
ative  integer  1  and  asked  to  produce  a  retimed  circuit  G,  with  minimum  clock  period  subject 
to  the  constraint  that  the  retimed  circuit  have  latency  at  most  l.  This  problem  is  just  the 

clock  period  minimization  problem  with  the  additional  constraint  that  r(u)  —  r(v)  <  l.  This 

< 

additional  constraint  can  be  modeled  by  augmenting  the  circuit  G  with  an  edge  vo  —*■  e'v/ 
having  weight  tu(e)  =  l.  Also,  the  methods  of  Section  8  can  be  applied  to  solve  the  problem 
of  minimizing  the  state  of  pipelined  circuitry  subject  to  upper  bounds  on  clock  period  and 
latency. 

Timing  at  external  interfaces.  An  external  interface  may  be  forced  to  meet  various 
timing  specifications.  For  instance,  if  an  external  interface  has  a  known  time  delay  between 
the  time  at  which  it  receives  outputs  from  the  circuit  and  the  time  at  which  it  presents  inputs, 
the  external  vertex  can  be  assigned  a  propagation  delay  greater  than  0.  By  augmenting  the 
set  of  inequalities  specified  in  Theorem  10,  it  is  often  possible  for  the  optimization  algorithms 
to  act  subject  to  other  constraints.  If  data  must  be  available  to  an  interface  along  some  edge 
v  A  v0  within  some  time  t  after  each  clock  tick,  for  example,  we  can  express  this  by  the 
inequality 

R(v)  -  r(v0)  <  -  +  ts(e), 
c 

where  v0  is  the  vertex  representing  the  interface.  This  constraint  is  equivalent  to  saying  that 
we  must  have  A(v)  <  t  if  the  register  count  tvr(e)  of  edge  e  is  zero  in  the  retimed  circuit. 
Similarly,  if  data  from  the  interface  is  not  available  on  an  edge  vo  A  v  until  some  time  t  after 
each  clock  tick,  this  constraint  can  be  expressed  by  the  inequality 

r(w0)  -  #(v)  <  u>(e)  -  -  -. 

c  c 

Geometric  considerations.  The  optimization  methods  discussed  in  this  paper  can  be 
applied  largely  independently  of  geometric  considerations  because  although  retiming  causes 
the  addition  and  deletion  of  registers,  it  otherwise  leaves  the  functional  elements  and  their  pat¬ 
tern  of  interconnection  the  same.  Thus  if  a  given  circuit  has  an  area-efficient  layout,  chances 
are  that  a  retimed  version  of  the  circuit  can  be  laid  out  efficiently.  In  some  cases,  however, 
the  floorplan  of  a  circuit  may  limit  the  number  of  registers  on  certain  interconnections. 

The  inequalities  that  constrain  the  retimed  system  can  be  augmented  to  express  these 
geometric  constraints.  For  example,  to  specify  an  upper  bound  k  on  the  number  of  registers 
that  can  fit  along  some  edge  u  A  v,  we  can  impose  the  constraint 


r(v)  —  r(u)  <k  —  w(e). 


We  can  also  model  a  equation  in  which  the  first  k  registers  on  an  edge  u  A  v  are  relatively 
cheap  and  additional  registers  are  more  expensive.  Add  an  auxiliary  vertex  u  in  the  middle 
of  edge  e.  Then  assign  a  high  cost  to  registers  on  the  edge  u  — ►  u  and  a  low  cost  to  registers 
on  the  edge  &  v,  but  constrain  4  — ►  t>  to  have  at  most  k  registers  in  the  retimed  system. 
Solve  the  system  of  constraints  as  in  Section  8.  On  the  other  hand,  if  the  first  register 
on  a  connection  is  expensive  and  additional  registers  are  cheap,  then  it  is  NP-complete  to 
determine  whether  a  circuit  can  be  retimed  to  achieve  a  specified  bound  on  register  coet  [17, 
pp.  182-183]. 

Slowdown.  In  Section  7  we  showed  how  a  e-slow  circuit,  which  supports  c  independent 
streams  of  computation,  can  be  reduced  to  support  a  single  stream  of  computation  by  remov¬ 
ing  registers.  The  notion  of  c-slow  circuitry  offers  new  insight  into  many  circuit  designs  that 
are  not  technically  c-slow.  Consider,  for  example,  a  2-slow  circuit  in  which  only  one  stream 
of  computation  is  being  used.  The  registers  in  the  circuit  fall  into  two  equivalence  classes, 
one  of  which  is  idle  during  each  clock  period.  Using  Algorithm  R  to  remove  all  the  registers 
in  one  equivalence  class  is  one  way  to  optimize  such  a  circuit. 

Another  way  to  save  registers  is  to  modify  the  functional  elements  to  perform  slightly 
different  actions  on  even  and  odd  time  steps  so  that  each  physical  register  plays  the  roles  of 
two  logical  registers,  one  in  each  equivalence  class.  A  cursory  examination  of  the  resulting 
circuit  would  not  reveal  that  it  is  2-slow  according  to  the  circuit  model,  but  it  would  never¬ 
theless  communicate  with  the  host  only  on  every  other  clock  tick.  Although  this  method  for 
saving  registers  may  sometimes  be  acceptable,  the  overhead  of  register  multiplexing  and  the 
complexity  of  control  suggest  that  Algorithm  R  is  a  more  reasonable  alternative.  Moreover, 
when  confronted  with  a  circuit  that  communicates  externally  only  on  every  other  clock  tick 
or  a  circuit  whose  functional  elements  perform  different  operations  on  alternate  clock  ticks, 
we  may  suspect  that  it  is  really  a  2-slow  circuit  in  disguise,  and  that  penetrating  the  disguise 
might  lead  to  improved  performance  and  simplification  of  the  control  logic. 

Data-dependent  propagation  delays.  A  major  deficiency  of  the  circuit  model  is  its 
inability  to  represent  combinational  logic  elements  with  data-dependent  propagation  delays. 
For  example,  if  a  multiplier  can  produce  an  answer  quickly  whenever  one  of  its  inputs  is  zero, 
its  propagation  delay  is  data  dependent.  One  would  like  to  take  advantage  of  the  shorter 
delay  whenever  possible  in  order  to  speed  a  larger  computation. 

While  we  are  unable  to  model  data  dependence  in  the  general  case,  we  can  sometimes 
use  the  extended  circuit  model  of  Section  9  to  partially  model  the  effects.  As  an  example, 
in  nMOS  circuits  [14],  the  transition  of  a  Boolean  signal  from  0  to  1  can  take  much  longer 
than  the  transition  of  the  same  signal  from  1  to  0.  We  can  model  this  situation  somewhat 
by  representing  each  wire  as  two  edges  in  the  graph,  one  representing  the  value  0  on  the  wire 
and  the  other  representing  1.  We  choose  propagation  delays  for  internal  edges  of  a  functional 
element  depending  on  how  0  or  1  inputs  affect  the  output.  Unfortunately,  we  cannot  model 
how  the  delays  affect  the  clock  period  exactly,  but  upper  bounds  can  be  obtained  which  will, 
for  example,  properly  model  the  propagation  delay  through  two  cascaded  inverters. 

There  is  much  more  to  be  understood  about  clocked  circuits.  Do  powerful  combinatorial 
optimization  techniques  apply  to  other  timing  models  such  as  those  involving  multiphase 


clocking  disciplines?  Can  datardependent  propagation  delays  be  handled  in  a  reasonably 
general  setting?  Is  it  possible  to  solve  the  state  minimisation  problem  with  a  polynomial¬ 
time  algorithm  that  is  simpler  than  the  typical  algorithms  for  solving  minimum-cost  flow 
problems?  Can  hierarchically  described  circuits  be  optimally  retimed  in  time  proportional  to 
their  descriptions?  Under  what  circumstances  can  optimal  retimings  of  parametrized  families 
of  circuits  be  algorithmically  obtained? 

Retiming  is  a  transformation  that  can  be  used  to  produce  efficient  circuits,  and  we  have 
presented  a  variety  of  algorithms  for  automatically  retiming  circuits.  Of  great  interest,  how¬ 
ever,  are  design  methodologies  in  which  retiming  is  performed  by  an  individual  instead  of  a 
algorithm,  as  is  done  in  [9],  [10],  [12],  and  [17].  Retiming  seems  to  be  a  valuable  technique 
which  could  be  incorporated  into  both  circuit  compilers  and  interactive  design  tools. 
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